Processing Wikipedia Dumps - A Case-study Comparing the XGrid and MapReduce Approaches

نویسندگان

Dominique Thiébaut

Yang Li

Diana Jaunzeikare

Alexandra Cheng

Ellysha Raelen Recto

Gillian Riggs

Xia Ting Zhao

Tonje Stolpestad

Cam Le T. Nguyen

چکیده

We present a simple comparison of the performance measured as the total execution time taken to parse a 27-GByte XML dump of the English wikipedia on three different cluster platforms: Apple’s XGrid, and Hadoop the open-source version of Google’s MapReduce. We use a local hadoop cluster of Linux workstation, as well as an Elastic MapReduce cluster rented from Amazon. We show that for selected benchmark, XGrid yields the fastest execution time, with the local Hadoop cluster a close second. The overhead of fetching data from Amazon’s Simple Storage System (S3), along with the inability to skip the reduce, sort, and merge phases on Amazon penalizes this platform targetted for much larger data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of the Origin and Behaviour of Arsenic in Mine Waste Dumps Using Correlation Analysis: A Case Study Sarcheshmeh Copper Mine

Knowledge of the probable origin and behaviour of arsenic certainly gives valuable insights into the potential for transfer in the environment and of the risks involved in mining sites. Sequential extraction analyses are common experiments often used to study the origin and behaviour of potentially toxic elements. The method, however, presents some deficiencies, including labor-intensive proced...

متن کامل

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

Matching Dispute Finder Claims to Wikipedia Articles

Dealing with large datasets is increasingly becoming a problem for natural language processing researchers. For our class project we investigate applying the opensource Hadoop MapReduce framework to the problem of information retrieval using TFIDF.

متن کامل

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling

It is well known that the output quality of statistical machine translation (SMT) systems increases with more training data. To obtain more parallel text for translation modeling, researchers have turned to the web to mine parallel sentences, but most previous approaches have avoided the difficult problem of pairwise similarity on cross-lingual documents and instead rely on heuristics. In contr...

متن کامل

Implementation and Evaluation of a Framework to calculate Impact Measures for Wikipedia Authors

Wikipedia, an open collaborative website, can be edited by anyone, even anonymously, thus becoming victim to ill-intentioned changes. Therefore, ranking Wikipedia authors by calculating impact measures based on the edit history can help to identify reputational users or harmful activity such as vandalism [4]. However, processing millions of edits on one system can take a long time. The author i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Processing Wikipedia Dumps - A Case-study Comparing the XGrid and MapReduce Approaches

نویسندگان

چکیده

منابع مشابه

Identification of the Origin and Behaviour of Arsenic in Mine Waste Dumps Using Correlation Analysis: A Case Study Sarcheshmeh Copper Mine

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

Matching Dispute Finder Claims to Wikipedia Articles

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling

Implementation and Evaluation of a Framework to calculate Impact Measures for Wikipedia Authors

عنوان ژورنال:

اشتراک گذاری